The PtildeTM Scripting Language
P~ (pronounced "p-tilde") is an open source general purpose, Java-like, scripting language and a regular expression engine offered as a Java library. One of the primary reasons to adopt P~ is for use in solving difficult matching, query, and transformation problems. Perl experts will discover that P~ offers certain regex powers not found in Perl, with the result being that with many types of hard problems, a Ptilde solution consists of a suitably composed single-pass regex. This is even more the case in comparison to java.util.regex because Java regex tries to emulate the Perl regex features, but has not built this power into the Java grammar per se. The following table examines some of the important differences between P~ and Perl regular expressions, as Perl has set the current standard for regex.
| Feature | Ptilde | Perl | java.util.regex |
|---|---|---|---|
| Code-insertion (side-effects) |
Called the DoPattern in P~. Statements wrap a regex and fire after the match if and only if wrapped regex is part of total match. | Called a code-assertion in Perl. Statements fire during matching process, often even if nearby characters are ultimately not a match. Therefore less useful because of unexpected "firing". | Does not exist in any form, programmer forced to intertwine fine-grained regexes with functional logic. |
| Named-capture | All capture expressions at any nesting level must specify a String reference by scoped name. Readable, and powerfully precise match-capture placement when combined with DoPattern and rule parameters. | Capture only to special variables $1, $2, etc. Much more difficult to control placement of submatch. | Similar to Perl with $1, $2, etc. Programmer forced to intertwine fine-grained regexes with functional logic. |
| Nested Capture | For example, capturing sub-regex matches to appropriate location in data structure, easy by combining DoPattern and CapturePattern with variables scoped to the nearest DoPattern. | Possible with simpler problems using code-assertions, but not arbitrarily possible, and very cumbersome. | Not possible. Programmer forced to use loops to aggregate document matches. |
| Rules (Pattern Functions) |
Functions that return Pattern (i.e. a regex) bind the argument value at time of call to the regex returned. Thus "pattern functions" can embed side-effects relative to arguments. Aids in reusability of design patterns. | Not available. | Not available. |
| Polymorphic Rules | Put matching behavior in virtual pattern functions of base class, and put side-effects that solve the problem in rederived pattern functions of sub-class. Accelerates solution design, and promotes reusability. | Not available. | Not available. |
| subjunctive grammar | Use secondary regex to qualify primary regex with "match/not match" at the same time. Opens the door to Boolean Query, find-first, orthogonal side-effects, and more. | Not available. | Not available. |
| and grammar | Similar to the subjunctive but allows you to have side-effects that "fire" in both primary and secondary of the 'and' expression. | Not available. | Not available. |
| Document Transformation Keyword (reout) |
In combination with DoPattern, allows any kind of transformation (replace,insert,delete,redirect) at any level of the regex. | Cannot solve complex problems without loop logic. Use global match/replace. | Same as with Perl. |
| Multiple lexers | Ptilde allows you to easily apply as many fine-grained "lexing/parsing" regexes to your stream as you wish, at any point in the parsing process. | Java's Matcher class makes it really difficult to apply more than one Matcher to the same CharSequence, remembering state between each find() or lookingAt() call. | Perl matches P~ in this capability. |
Scripting Language and Java Library
Ptilde's performance as a basic scripting language is comparable to the Java VM with its hotspot compiler turned off. This is quite fast for a scripting language. Therefore, it can be used from the command line to perform document transformation and reporting tasks, as is commonly expected of a scripting language. The strength of P~ lies in its regex grammar and engine. It is therefore anticipated that most users of P~ will focus on problems best solved by regexes. These include problems of converting documents (transformations), reporting on documents such as log files, matching and filtering text, searching for and extracting text, and the like.
The P~ engine is offered not only as a standalone executable, but also as a combination of a Java classes (the translator) and a native JNI library (the engine) for professional use. That is, Java enterprise programmers will find it easy to convert a standalone command line script solution into one that can be integrated and called from a running Java application. In fact, P~ provides Java classes for running script instances and those scripts run within the JVM thread-space and memory-space, not a separate process.
As an agile language, P~ is not only best used for pattern matching, but the documentation demonstrates that P~ is one of the most powerful regex engines available. The reasons why P~ is so powerful as a regex engine include its superior performance as well as novel grammar forms that make P~ regexes more readable and reusable. In particular, the P~ engine is a DFA engine, not a backtracking NFA engine like Perl and java.util.regex. Despite being a DFA engine, it has introduced an unprecedented level of side-effects, so that the matching characteristics of a regex are more easily integrated to problem solving aspects. To the P~ programmer, the latter means that matching expressions need not be so unpleasantly intertwined to functional logic, as is normally the case with solutions rendered in Perl.
Why P~?
If you're a Java programmer, and therefore with access to the java.util.regex package, why would you consider deploying the P~ engine to help solve enterprise-class matching and transformation problems. There are several reasons. Consider the following:
- The P~ regex engine is written in native code and is therefore consistently faster than the Java engine for equivalent solutions
- The P~ regex engine uses deterministic finite automata and therefore its solutions involve no "backtracking", a significant advantage for solutions to difficult problems
- The P~ regex grammar is inherently more powerful (expressive) than the Java regular expression package and can be used to solve much more difficult matching and transformation problems
- The P~ regex grammar produces much more readable and maintainable solutions
- The P~ regex grammar has much more powerful side-effects (to the matching process) and therefore is more likely to be able to solve very difficult problems in a single-pass through the entire document with all functional logic fully embedded in the regex
- A lot of thought has gone into the P~ interpreter and VM to minimize the impedence cost of using P~ scripts from Java programs, especially by making it easy to pass and return arguments between the Java app and P~ scripts, as well as access Java classes from P~ scripts
Easy to Learn
P~ is remarkably easy to learn, especially for Java programmers. Its syntax for everything but regexes borrows extensively from Java. It is strong-typed. It has the same set of primitive scalar types as in Java. Its statement and expression syntax is as in Java. Its means for declaring arrays, functions, structs and interfaces is similar to that of Java. It even allows for importing classes and interfaces from a host Java application. However, as an agile language, it has subtle differences relative to Java, such as introducing maps as a built-in data type, allowing arrays to be automatically growable, and throwing fewer runtime exceptions.
But the above statements regard P~ as being an standalone scripting language. Like Groovy, it also allows the programmer to use Java libraries from scripts. But the focus for ptilde is to be a regex grammar and engine, and in this regard, regex programming has been made easier than ever before. The popular meta-character syntax of Perl and similar engines is not used. Instead, regexes in P~ are composed as normal algebraic expressions, using the standard composition operators of Java, with attendant rules regarding associativity and precedence. As a result, P~ regexes are more readable and maintainable.
Scalable Regex Engine
Most regex engines are optimized for micro-benchmarks. That is, the regex being tested typically matches one line of data, and a loop is used to iterate through the document line by line.
P~ is optimized for solving complex problems involving large documents, in such a way that a single regex is composed that matches the entire document, and solves the problem via "side-effects". The speed of the P~ regex engine for document-level regexes is independent of document size, and largely independent of problem complexity. The inherent speed compares favorably to that of the Java regex engine, and though perhaps slower than Perl and Python for micro-benchmarks, will outperform these engines with document-level regexes that solve difficult, real problems.
The scalability advantages of P~ relative to all regex engines are the following: (1) most problems can be solved in P~ with one regex that makes one pass through the document, (2) the P~ engine never goes "super-linear" (like Perl, Python, Ruby, and most others) when tackling a tough regex, (3) there is no need to bring the document into memory, just apply the regex directly to the i/o stream (if the stream doesn't fit in memory, no problem!), (4) P~ is not a backtracking engine, it visits each character in the stream just once, (5) a MemoryException is both rare and non-fatal, it just means that you should integrate the "lazy" composition technique into your regex, (6) you don't have to worry about pre-compiling and/or caching your regexes, as this is done for you automagically.
Thus, when addressing a difficult parsing, transformation, or tokenization problem, you, the programmer, focus on solving the problem correctly, employing all of the grammar advantages of P~. You can be confident that the P~ engine will perform for a hard problem as effectively as it handles a micro-benchmark.
Having said this, we have posted some benchmarks for simple problems that demonstrate that P~ is also a very fast regex engine.
Powerful Regex Grammar
P~ offers powerful matching expressions. Its grammatical powers largely match and in some areas such as Boolean Query exceed those of Perl, accomplished despite the fact that it is a DFA engine. Its side-effect power is unprecedented. This allows the integration of functional statements that solve the problem associated with the match, directly into the regex composition. This feature is significantly more powerful than Perl's unique code assertion grammar.
Overall the regex grammar powers are much more than semantic "sugar". Great evidence for this is the useful example that converts tabs to spaces in a Java source file document. IDEs such as Eclipse use advanced parsers to do the job, but we can solve this problem robustly in a 140 line scriptlet. Take a quick look at the advanced example in the examples section if you want to see a problem that only one regex engine can solve!
Simple API for Java Integration
The primary enterprise use of P~ is to allow a Java application of any kind (including J2EE) to free itself from the dependence on the java.util.regex package for matching problems. Not that this is a bad package, and is actually best in breed. However, it is not nearly as powerful a grammar as Perl, whereas the P~ solveability rivals and in many cases exceeds that of Perl for regex problems. The recommended technique for using P~ from a Java application to solve matching problems (with side-effects!) is to write a scriptlet in P~ that does the matching chore. Test it in the script runner application that comes with the product. Deploy the scriptlet in the jar file of your Java application as a resource. And write a simple Java class with a method that abstracts the chore with as many parameters as you like, that allow you to generalize the matching solution. This method then makes a simple call to the run() or runobj() method of the p7e.engine.Script class, and then optionally returns the appropriate result such as an array of Java Strings. This technique is well-explained in the scriptlet section of the specification, and is a very good place to begin your evaluation of P~, because you will see the recommended design patterns for enterprise integration, and get a sense as well of the solveability power of P~.
Prior to P~, Java programmers were at a distinct disadvantage to Perl and Python programmers when tackling tough regex matching and transformation problems via java.util.regex. Solutions run significantly slower in Java-regex than equivalent solutions in Perl and Python. The syntax of Java regex, though Perl-like, has the disadvantage of having to double your backslashes, making them much less readable. And it is extremely difficult to incrementally compose a regex in Java that is suitable for one pass through the document, especially when complex side-effects are involved. Add to this, Java regex can cause a "fatal" VM error with a tough regex (see the "isUTF8" problem).
But because the P~ engine is integrated into the Java platform so nicely, and because its regex engine has none of the defects of the Java regex engine, the combination is an ideal vehicle for solving tough transformation and matching problems in Java applications. WIth the combination of P~ engine and grammar, a Java application can effectively solve any matching/transformation problems that would otherwise recommend the use of Perl or Python!
Perl FTW
dood
Perl is soooo old school. You need to join the tweentieth sentry.

it looks like a prominent java tech web site ... unfortunately where are your activities ... forum, update, bug note ...