If you can find a specific file that causes Tika to either run out of stack or use huge quantities of memory, it would be great to include it (if possible) in a TIKA jira ticket. We'd need a stack trace, of course, showing that Tika is responsible.
Thanks, Karl On Wed, Apr 15, 2015 at 11:51 AM, Kamil Żyta <[email protected]> wrote: > On Wed, Apr 15, 2015 at 11:16:44AM -0400, Karl Wright wrote: > > Hi Kamil, > > > > I bet that it is one specific file that was causing the problem. By > > increasing the stack space, you allowed the file to be processed. Now it > > won't get processed again until it changes. > > > > My thought is that this is *probably* related to Tika. Are you using the > > Tika transformer? > > yes, I use Tika transformation and I think this is related to Tika too but > don't > know which file cause the problem. I have two identical jobs (one for > continuous crawl > and one for deletion), these jobs report diffrent documents count and only > continuous job cause regex errors. > > Another job give me "agents process ran out of memory - shutting down" but > this is related to Tika too. Excluded one file and now is working. > > K > > > > > > > On Wed, Apr 15, 2015 at 9:11 AM, Kamil Żyta <[email protected]> > wrote: > > > > > I stopped all agents, removed all logs, add '-Xss500m' to options file, > > > started agents and errors are gone. Now I removed '-Xss500m' from > options > > > to trap the source of the problem, restart all agents and still no > errors. > > > > > > *magic* > > > > > > Thx Karl for you patience and my weird problems. > > > > > > K > > > > > > On Wed, Apr 15, 2015 at 08:39:52AM -0400, Karl Wright wrote: > > > > Hi Kamil, > > > > > > > > I believe your logs are probably "rolling". This means that when > the log > > > > gets full, or another day starts, a new log file starts. I don't > know, > > > of > > > > course, because I did not configure your system. > > > > > > > > What I *do* know is that the stack trace that you are providing me is > > > > incomplete, and while it is clear that the Java regular expression > parser > > > > is failing in some way (by doing infinite recursion), I have no idea > what > > > > *context* this is occurring in, without the end of that stack trace. > > > > > > > > This may be occurring almost anywhere, which is why I need the trace. > > > Even > > > > String.replace() and String.split() use regexps and can be at fault. > > > > Without a definitive source, there's little I can do. > > > > > > > > One thing you can certainly try is to provide a larger amount of > stack > > > > space to the JVM and just hope the problem goes away. That would > mean > > > > editing one of the options files and adding a parameter: > > > > > > > > -Xss500m > > > > > > > > (for instance) > > > > > > > > If you would rather get to the source of the problem, I suggest the > > > > following: > > > > > > > > (1) Shut down all agents processes > > > > (2) Remove all logs > > > > (3) Start the agents process > > > > (4) Tail the log looking for "FATAL": tail -f manifoldcf.log | grep > FATAL > > > > (5) As soon as you see that, shut down the agents process > > > > (6) Look at the log file produced > > > > > > > > References: > > > > > > > > http://stackoverflow.com/questions/7509905/java-lang-stackoverflowerror-while-using-a-regex-to-parse-big-strings > > > > > > > > Karl > > > > > > > > > > > > On Wed, Apr 15, 2015 at 8:28 AM, Kamil Żyta <[email protected]> > > > wrote: > > > > > > > > > # java -version > > > > > java version "1.8.0_45" > > > > > Java(TM) SE Runtime Environment (build 1.8.0_45-b14) > > > > > Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode) > > > > > > > > > > it's broken? I don't know. How can I prevend rolling backtrace? > > > > > It's look like infinity loop for me. > > > > > > > > > > K > > > > > > > > > > On Wed, Apr 15, 2015 at 07:41:37AM -0400, Karl Wright wrote: > > > > > > Clearly the logs must have rolled then? Either that or you are > > > using a > > > > > > broken jdk. > > > > > > > > > > > > Karl > > > > > > > > > > > > > > > > > > On Wed, Apr 15, 2015 at 7:37 AM, Kamil Żyta < > [email protected]> > > > > > wrote: > > > > > > > > > > > > > On Wed, Apr 15, 2015 at 07:27:56AM -0400, Karl Wright wrote: > > > > > > > > Hi Kamil: > > > > > > > > > > > > > > > > kawright@duck76:/data/kawright/analysis$ gzip --version > > > > > > > > gzip 1.4 > > > > > > > > Copyright (C) 2007 Free Software Foundation, Inc. > > > > > > > > Copyright (C) 1993 Jean-loup Gailly. > > > > > > > > This is free software. You may redistribute copies of it > under > > > the > > > > > > > terms of > > > > > > > > the GNU General Public License < > > > http://www.gnu.org/licenses/gpl.html > > > > > >. > > > > > > > > There is NO WARRANTY, to the extent permitted by law. > > > > > > > > > > > > > > > > Written by Jean-loup Gailly. > > > > > > > > kawright@duck76:/data/kawright/analysis$ > > > > > > > > > > > > > > > > > > > > > > > > But in any case the key part of the stack trace is further > down, > > > > > probably > > > > > > > > MUCH further down. > > > > > > > > > > > > > > > > If I were you, I'd unzip the whole log and use head, tail, > and > > > grep > > > > > to > > > > > > > find > > > > > > > > where the exception trace ends. > > > > > > > > > > > > > > I use grep -v and send you logs before but you don't belive me. > > > > > > > It's all mcf logs http://pastebin.com/T54NKwTh > > > > > > > http://pastebin.com/uMxaUnGi > > > > > > > > > > > > > > K > > > > > > > > > > > > > > > > > > > > > > > On Wed, Apr 15, 2015 at 7:18 AM, Kamil Żyta < > > > [email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > > > hmm, try tar -xf manifoldcf.log.gz or maybe zless? > > > > > > > > > It's work for me with: > > > > > > > > > > gzip --version > > > > > > > > > gzip 1.6 > > > > > > > > > > > > > > > > > > For sure I attached uncompressed file. > > > > > > > > > > > > > > > > > > K > > > > > > > > > > > > > > > > > > On Wed, Apr 15, 2015 at 07:10:07AM -0400, Karl Wright > wrote: > > > > > > > > > > Hi Kamil, > > > > > > > > > > > > > > > > > > > > >>>>>> > > > > > > > > > > kawright@duck76:~$ cd /data/kawright/analysis/ > > > > > > > > > > kawright@duck76:/data/kawright/analysis$ gunzip > > > > > manifoldcf.log.gz > > > > > > > > > > > > > > > > > > > > gzip: manifoldcf.log.gz: invalid compressed data--crc > error > > > > > > > > > > > > > > > > > > > > gzip: manifoldcf.log.gz: invalid compressed data--length > > > error > > > > > > > > > > kawright@duck76:/data/kawright/analysis$ > > > > > > > > > > > > > > > > > > > > <<<<<< > > > > > > > > > > > > > > > > > > > > Karl > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Apr 15, 2015 at 6:41 AM, Kamil Żyta < > > > > > [email protected]> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > these 1k lines are the same. I attached full > > > manifoldcf.log. > > > > > > > > > > > > > > > > > > > > > > K > > > > > > > > > > > > > > > > > > > > > > On Wed, Apr 15, 2015 at 06:33:06AM -0400, Karl Wright > > > wrote: > > > > > > > > > > > > Hi Kamil, > > > > > > > > > > > > > > > > > > > > > > > > There is a complete trace in there, believe me. The > JVM > > > did > > > > > not > > > > > > > > > say: " > > > > > > > > > > > (...) > > > > > > > > > > > > ~1k lines". What I need is at the bottom of those 1K > > > lines. > > > > > > > > > > > > > > > > > > > > > > > > Karl > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Apr 15, 2015 at 6:23 AM, Kamil Żyta < > > > > > > > [email protected]> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > How can I provide usable stack trace? I can only > copy > > > what > > > > > logs > > > > > > > > > says. > > > > > > > > > > > > > Now it's a lot of: > > > > > > > > > > > > > FATAL 2015-04-15 12:14:35,645 (Worker thread '5') - > > > Error > > > > > > > tossed: > > > > > > > > > null > > > > > > > > > > > > > java.lang.StackOverflowError > > > > > > > > > > > > > at > > > > > > > > > > > > > > java.util.regex.Pattern$CharProperty.match(Pattern.java:3776) > > > > > > > > > > > > > at > > > > > > > java.util.regex.Pattern$Curly.match0(Pattern.java:4250) > > > > > > > > > > > > > at > > > > > > > java.util.regex.Pattern$Curly.match0(Pattern.java:4263) > > > > > > > > > > > > > (...) ~1k lines > > > > > > > > > > > > > > > > > > > > > > > > > > for continuous job but agents is not exiting. > Propably > > > > > this two > > > > > > > > > errors > > > > > > > > > > > > > below isn't correlated (patterns and agents oom). > > > > > > > > > > > > > > > > > > > > > > > > > > K > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 14, 2015 at 05:28:18PM -0400, Karl > Wright > > > > > wrote: > > > > > > > > > > > > > > Without some kind of usable stack trace I can't > > > really > > > > > help > > > > > > > > > you. It > > > > > > > > > > > > > looks > > > > > > > > > > > > > > like some regular expression is going completely > > > haywire, > > > > > > > but I > > > > > > > > > have > > > > > > > > > > > no > > > > > > > > > > > > > > idea which one. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Karl > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 14, 2015 at 4:31 PM, Kamil Żyta < > > > > > > > > > [email protected]> > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 14, 2015 at 04:12:55PM -0400, Karl > > > Wright > > > > > > > wrote: > > > > > > > > > > > > > > > > Hi Kamil, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Without the bottom of the stack trace, I > can't > > > even > > > > > tell > > > > > > > > > what it > > > > > > > > > > > is > > > > > > > > > > > > > > > doing. > > > > > > > > > > > > > > > > Where are you supplying a regular expression? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It's all I have, the only regular expression > is in > > > > > 'Paths': > > > > > > > > > > > > > > > 3. Exclude file(s) or directory(s) matching > */.* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I found files (~500MB, logs) where solr logs > ends, > > > > > > > > > > > > > > > exclude them solves the problem. mcf use tika > for > > > > > > > extracting > > > > > > > > > > > > > > > and only /update to solr, these files causes > > > problem > > > > > befor, > > > > > > > > > > > > > > > when using solr for extract docs. Now mcf dies > and > > > I > > > > > do not > > > > > > > > > even > > > > > > > > > > > know > > > > > > > > > > > > > why. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > K > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Running out of memory might be a side effect > of > > > > > running > > > > > > > out > > > > > > > > > of > > > > > > > > > > > stack. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Karl > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 14, 2015 at 2:49 PM, Kamil Żyta < > > > > > > > > > > > [email protected]> > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > agent process exit with: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > agents process ran out of memory - shutting > > > down > > > > > > > > > > > > > > > > > java.lang.OutOfMemoryError: Java heap space > > > > > > > > > > > > > > > > > at > > > > > > > java.util.Arrays.copyOfRange(Arrays.java:3664) > > > > > > > > > > > > > > > > > at > > > java.lang.String.<init>(String.java:201) > > > > > > > > > > > > > > > > > at > > > > > > > > > > > > java.lang.StringBuilder.toString(StringBuilder.java:407) > > > > > > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.buildSolrDocument(HttpPoster.java:987) > > > > > > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:882) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > workers threads: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > FATAL 2015-04-14 18:59:11,172 (Worker > thread > > > '32') > > > > > - > > > > > > > Error > > > > > > > > > > > tossed: > > > > > > > > > > > > > null > > > > > > > > > > > > > > > > > java.lang.StackOverflowError > > > > > > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > java.util.regex.Pattern$CharProperty.match(Pattern.java:3776) > > > > > > > > > > > > > > > > > at > > > > > > > > > > > java.util.regex.Pattern$Curly.match0(Pattern.java:4250) > > > > > > > > > > > > > > > > > at > > > > > > > > > > > java.util.regex.Pattern$Curly.match0(Pattern.java:4263) > > > > > > > > > > > > > > > > > at > > > > > > > > > > > java.util.regex.Pattern$Curly.match0(Pattern.java:4263) > > > > > > > > > > > > > > > > > at > > > > > > > > > > > java.util.regex.Pattern$Curly.match0(Pattern.java:4263) > > > > > > > > > > > > > > > > > (...) ~1k lines > > > > > > > > > > > > > > > > > at > > > > > > > > > > > java.util.regex.Pattern$Curly.match0(Pattern.java:4263) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > no errors/warns in solr logs. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > it's bug or just corrupted file? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > K > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
