Re: Errors on master after upgrading to 4.10.3

2016-02-17 Thread Joseph Hagerty
Ahh, makes sense. I did have a feeling I was barking up the wrong tree
since it's an Extraction issue, but I thought I'd throw it out there,
anyway.

Thanks so much for the information!

On Wed, Feb 17, 2016 at 4:49 PM, Rachel Lynn Underwood <
r.lynn.underw...@gmail.com> wrote:

> This is an error being thrown by Apache PDFBox/Tika. You're seeing it now
> because Solr 4.x uses a different Tika version than Solr 3.x.
>
> It looks like this error is thrown when you parse a PDF with Tika, and a
> font in that PDF doesn't have a ToUnicode mapping.
> https://issues.apache.org/jira/browse/PDFBOX-1408
>
> Another user reported that this might be related to special characters, but
> PDFBox developers haven't been able to reproduce the bug.
> https://issues.apache.org/jira/browse/PDFBOX-1706
>
> Since this isn't an issue in the Solr code, if you're concerned about it,
> you'll probably have better luck asking the PDFBox developers directly, via
> Jira or their mailing list.
>
>
> On Tue, Feb 16, 2016 at 12:08 PM, Joseph Hagerty  wrote:
>
> > Does literally nobody else see this error in their logs? I see this error
> > hundreds of times per day, in occasional bursts. Should I file this as a
> > bug?
> >
> > On Mon, Feb 15, 2016 at 4:56 PM, Joseph Hagerty 
> wrote:
> >
> > > After migrating from 3.5 to 4.10.3, I'm seeing the following error with
> > > alarming regularity in the master's error log:
> > >
> > > 2/15/2016, 4:32:22 PM ERROR PDSimpleFont Can't determine the width of
> the
> > > space character using 250 as default
> > > I can't seem to glean much information about this one from the web. Has
> > > anyone else fought this error?
> > >
> > > In case this helps, here's some technical/miscellaneous info:
> > >
> > > - I'm running a master-slave set-up.
> > >
> > > - I rely on the ERH (tika/solr-cell/whatever) for extracting plaintext
> > > from .docs and .pdfs. I'm guessing that PDSimpleFont is a component of
> > > this, but I don't know the first thing about it.
> > >
> > > - I have the clients specifying 'autocommit=6s' in their requests,
> which
> > I
> > > realize is a pretty aggressive commit interval, but so far that hasn't
> > > caused any problems I couldn't surmount.
> > >
> > > - There are north of 11 million docs in my index, which is 36 gigs
> thick.
> > > The storage volume is only 10% full.
> > >
> > > - When I migrated from 3.5 to 4.10.3, I correctly performed a reindex
> due
> > > to incompatibility between versions.
> > >
> > > - Both master and slave are running on AWS instances, C4.4XL's (16
> cores,
> > > 30 gigs of RAM).
> > >
> > > So far, I have been unable to reproduce this error on my own: I can
> only
> > > observe it in the logs. I haven't been able to tie it to any specific
> > > document.
> > >
> > > Let me know if further information would be helpful.
> > >
> > >
> > >
> > >
> >
> >
> > --
> > - Joe
> >
>



-- 
- Joe


Re: Errors on master after upgrading to 4.10.3

2016-02-16 Thread Joseph Hagerty
Does literally nobody else see this error in their logs? I see this error
hundreds of times per day, in occasional bursts. Should I file this as a
bug?

On Mon, Feb 15, 2016 at 4:56 PM, Joseph Hagerty  wrote:

> After migrating from 3.5 to 4.10.3, I'm seeing the following error with
> alarming regularity in the master's error log:
>
> 2/15/2016, 4:32:22 PM ERROR PDSimpleFont Can't determine the width of the
> space character using 250 as default
> I can't seem to glean much information about this one from the web. Has
> anyone else fought this error?
>
> In case this helps, here's some technical/miscellaneous info:
>
> - I'm running a master-slave set-up.
>
> - I rely on the ERH (tika/solr-cell/whatever) for extracting plaintext
> from .docs and .pdfs. I'm guessing that PDSimpleFont is a component of
> this, but I don't know the first thing about it.
>
> - I have the clients specifying 'autocommit=6s' in their requests, which I
> realize is a pretty aggressive commit interval, but so far that hasn't
> caused any problems I couldn't surmount.
>
> - There are north of 11 million docs in my index, which is 36 gigs thick.
> The storage volume is only 10% full.
>
> - When I migrated from 3.5 to 4.10.3, I correctly performed a reindex due
> to incompatibility between versions.
>
> - Both master and slave are running on AWS instances, C4.4XL's (16 cores,
> 30 gigs of RAM).
>
> So far, I have been unable to reproduce this error on my own: I can only
> observe it in the logs. I haven't been able to tie it to any specific
> document.
>
> Let me know if further information would be helpful.
>
>
>
>


-- 
- Joe


Errors on master after upgrading to 4.10.3

2016-02-15 Thread Joseph Hagerty
After migrating from 3.5 to 4.10.3, I'm seeing the following error with
alarming regularity in the master's error log:

2/15/2016, 4:32:22 PM ERROR PDSimpleFont Can't determine the width of the
space character using 250 as default
I can't seem to glean much information about this one from the web. Has
anyone else fought this error?

In case this helps, here's some technical/miscellaneous info:

- I'm running a master-slave set-up.

- I rely on the ERH (tika/solr-cell/whatever) for extracting plaintext from
.docs and .pdfs. I'm guessing that PDSimpleFont is a component of this, but
I don't know the first thing about it.

- I have the clients specifying 'autocommit=6s' in their requests, which I
realize is a pretty aggressive commit interval, but so far that hasn't
caused any problems I couldn't surmount.

- There are north of 11 million docs in my index, which is 36 gigs thick.
The storage volume is only 10% full.

- When I migrated from 3.5 to 4.10.3, I correctly performed a reindex due
to incompatibility between versions.

- Both master and slave are running on AWS instances, C4.4XL's (16 cores,
30 gigs of RAM).

So far, I have been unable to reproduce this error on my own: I can only
observe it in the logs. I haven't been able to tie it to any specific
document.

Let me know if further information would be helpful.


Re: JVM heap constraints and garbage collection

2014-01-31 Thread Joseph Hagerty
Thanks, Shawn. This information is actually not all that shocking to me.
It's always been in the back of my mind that I was "getting away with
something" in serving from the m1.large. Remarkably, however, it has served
me well for nearly two years; also, although the index has not always been
30GB, it has always been much larger than the RAM on the box. As you
suggested, I can only suppose that usage patterns and the index schema have
in some way facilitated minimal heap usage, up to this point.

For now, we're going to increase the heap size on the instance and see
where that gets us; if it still doesn't suffice for now, then we'll upgrade
to a more powerful instance.

Michael, thanks for weighing in. Those i2 instances look delicious indeed.
Just curious -- have you struggled with garbage collection pausing at all?



On Thu, Jan 30, 2014 at 7:43 PM, Shawn Heisey  wrote:

> On 1/30/2014 3:20 PM, Joseph Hagerty wrote:
>
>> I'm using Solr 3.5 over Tomcat 6. My index has reached 30G.
>>
>
> 
>
>
>  - The box is an m1.large on AWS EC2. 2 virtual CPUs, 4 ECU, 7.5 GiB RAM
>>
>
> One detail that you did not provide was how much of your 7.5GB RAM you are
> allocating to the Java heap for Solr, but I actually don't think I need
> that information, because for your index size, you simply don't have
> enough. If you're sticking with Amazon, you'll want one of the instances
> with at least 30GB of RAM, and you might want to consider more memory than
> that.
>
> An ideal RAM size for Solr is equal to the size of on-disk data plus the
> heap space used by Solr and other programs.  This means that if your java
> heap for Solr is 4GB and there are no other significant programs running on
> the same server, you'd want a minimum of 34GB of RAM for an ideal setup
> with your index.  4GB of that would be for Solr itself, the remainder would
> be for the operating system to fully cache your index in the OS disk cache.
>
> Depending on your query patterns and how your schema is arranged, you
> *might* be able to get away as little as half of your index size just for
> the OS disk cache, but it's better to make it big enough for the whole
> index, plus room for growth.
>
> http://wiki.apache.org/solr/SolrPerformanceProblems
>
> Many people are *shocked* when they are told this information, but if you
> think about the relative speeds of getting a chunk of data from a hard disk
> vs. getting the same information from memory, it's not all that shocking.
>
> Thanks,
> Shawn
>
>


-- 
- Joe


JVM heap constraints and garbage collection

2014-01-30 Thread Joseph Hagerty
Greetings esteemed Solr-ites,

I'm using Solr 3.5 over Tomcat 6. My index has reached 30G.

Since my average load during peak hours is becoming quite high, and since
I'm finally starting to notice a little bit of performance degradation and
intermittent errors (e.g. "Solr returned response 0" on perfectly valid
reads during load spikes), I think it's time to tune my Slave box before
things get out of control.

In particular, *I am curious how others are tuning their JVM heap
constraints (xms, xms, etc.) and garbage collection (parallel or
concurrent) to meet the needs of Solr*. I am using the Sun JVM Version 6,
not the fancy third party offerings.

Some more info, FWIW:

- Average document size in my index is probably around 6k
- Using CentOS
- Master-Slave setup. Master gets all the writes, Slave gets all the read
requests. It is the *Slave* that is suffering-- the Master seems fine.
- The box is an m1.large on AWS EC2. 2 virtual CPUs, 4 ECU, 7.5 GiB RAM
- DaemonThreads skyrocket during the aforementioned load spikes

Thanks for reading, and to the devs: thanks for an excellent product.

-- 
- Joe


Re: ExtractRH: How to strip metadata

2012-05-02 Thread Joseph Hagerty
How interesting! You know, I did at one point consider that perhaps the
fieldname "meta" may be treated specially, but I talked myself out of it. I
reasoned that a field name in my local schema should have no bearing on how
a plugin such as solr-cell/Tika behaves. I should have tested my
hypothesis; even if this phenomenon turns out to be undocumented behavior,
I consider myself a victim of my own assumptions.

I am running version 3.5. You may have gotten the multivalue errors due to
the way your test schema and/or extracting request handler is lain out (my
bad). I am using the "ignored" fieldtype and a dynamicField called
"ignored_" as a catch-all for extraneous fields delivered by Tika.

Thanks for your help! Please keep me posted on any further
insights/revelations, and I'll do the same.

On Wed, May 2, 2012 at 12:54 PM, Jack Krupansky wrote:

> I did some testing, and evidently the "meta" field is treated specially
> from the ERH.
>
> I copied the example schema, and added both "meta" and "metax" fields and
> set "fmap.content=metax", and lo and behold only the doc content appears in
> "metax", but all the doc metadata appears in "meta".
>
> Although, I did get 400 errors with Solr complaining that "meta" was not a
> multivalued field. This is with Solr 3.6. What release of Solr are you
> using?
>
> I was not aware of this undocumented feature. I haven't checked the code
> yet.
>
>
> -- Jack Krupansky
>
> -Original Message- From: Joseph Hagerty
> Sent: Wednesday, May 02, 2012 11:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: ExtractRH: How to strip metadata
>
>
> I do not. I commented out all of the copyFields provided in the default
> schema.xml that ships with 3.5. My schema is rather minimal. Here is my
> fields block, if this helps:
>
> 
>   required="true"  />
>   required="true"  />
>   required="true"  />
>   required="true"  />
>  
>  
> 
>
>
> On Wed, May 2, 2012 at 10:59 AM, Jack Krupansky *
> *wrote:
>
>  Check to see if you have a CopyField for a wildcard pattern that copies to
>> "meta", which would copy all of the Tika-generated fields to "meta."
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Joseph Hagerty
>> Sent: Wednesday, May 02, 2012 9:56 AM
>> To: solr-user@lucene.apache.org
>> Subject: ExtractRH: How to strip metadata
>>
>>
>> Greetings Solr folk,
>>
>> How can I instruct the extract request handler to ignore metadata/headers
>> etc. when it constructs the "content" of the document I send to it?
>>
>> For example, I created an MS Word document containing just the word
>> "SEARCHWORD" and nothing else. However, when I ship this doc to my solr
>> server, here's what's thrown in the index:
>>
>> 
>> Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments
>> stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm
>> Page-Count 1 subject Application-Name Microsoft Macintosh Word Author
>> Jesus
>> Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date
>> 2008-11-05T20:19:00Z stream_content_type application/octet-stream
>> Character
>> Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/**
>>
>> phpHCIg7y
>> Company Parkman Elastomers Pvt Ltd Content-Type application/msword
>> Keywords
>> Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD
>> 
>>
>> All I want is the body of the document, in this case the word
>> "SEARCHWORD."
>>
>> For further reference, here's my extraction handler:
>>
>> >startup="lazy"
>>class="solr.extraction.ExtractingRequestHandler" >
>>
>>  
>>
>>meta
>>true
>>ignored_
>>  
>>  
>>
>> (Ironically, "meta" is the field in the solr schema to which I'm
>> attempting
>> to extract the body of the document. Don't ask).
>>
>> Thanks in advance for any pointers you can provide me.
>>
>> --
>> - Joe
>>
>>
>
>
> --
> - Joe
>



-- 
- Joe


Re: ExtractRH: How to strip metadata

2012-05-02 Thread Joseph Hagerty
I do not. I commented out all of the copyFields provided in the default
schema.xml that ships with 3.5. My schema is rather minimal. Here is my
fields block, if this helps:

 
   
   
   
   
   
   
 


On Wed, May 2, 2012 at 10:59 AM, Jack Krupansky wrote:

> Check to see if you have a CopyField for a wildcard pattern that copies to
> "meta", which would copy all of the Tika-generated fields to "meta."
>
> -- Jack Krupansky
>
> -----Original Message- From: Joseph Hagerty
> Sent: Wednesday, May 02, 2012 9:56 AM
> To: solr-user@lucene.apache.org
> Subject: ExtractRH: How to strip metadata
>
>
> Greetings Solr folk,
>
> How can I instruct the extract request handler to ignore metadata/headers
> etc. when it constructs the "content" of the document I send to it?
>
> For example, I created an MS Word document containing just the word
> "SEARCHWORD" and nothing else. However, when I ship this doc to my solr
> server, here's what's thrown in the index:
>
> 
> Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments
> stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm
> Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus
> Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date
> 2008-11-05T20:19:00Z stream_content_type application/octet-stream Character
> Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/**
> phpHCIg7y
> Company Parkman Elastomers Pvt Ltd Content-Type application/msword Keywords
> Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD
> 
>
> All I want is the body of the document, in this case the word "SEARCHWORD."
>
> For further reference, here's my extraction handler:
>
>  startup="lazy"
> class="solr.extraction.**ExtractingRequestHandler" >
>   
> 
> meta
> true
> ignored_
>   
>  
>
> (Ironically, "meta" is the field in the solr schema to which I'm attempting
> to extract the body of the document. Don't ask).
>
> Thanks in advance for any pointers you can provide me.
>
> --
> - Joe
>



-- 
- Joe


ExtractRH: How to strip metadata

2012-05-02 Thread Joseph Hagerty
Greetings Solr folk,

How can I instruct the extract request handler to ignore metadata/headers
etc. when it constructs the "content" of the document I send to it?

For example, I created an MS Word document containing just the word
"SEARCHWORD" and nothing else. However, when I ship this doc to my solr
server, here's what's thrown in the index:


Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments
stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm
Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus
Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date
2008-11-05T20:19:00Z stream_content_type application/octet-stream Character
Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/phpHCIg7y
Company Parkman Elastomers Pvt Ltd Content-Type application/msword Keywords
Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD


All I want is the body of the document, in this case the word "SEARCHWORD."

For further reference, here's my extraction handler:

 

  
  meta
  true
  ignored_

  

(Ironically, "meta" is the field in the solr schema to which I'm attempting
to extract the body of the document. Don't ask).

Thanks in advance for any pointers you can provide me.

-- 
- Joe