Re: Crawling and indexing very slow

Karl Wright Thu, 31 Jul 2014 13:25:38 -0700

Hi Ameya,

You cannot just comment out that line; instead you must supply an input
stream.  But you can create a null input stream, for example:


data.setBinary(new ByteArrayInputStream(new byte[0]),0);

Karl


On Thu, Jul 31, 2014 at 4:22 PM, Ameya Aware <[email protected]> wrote:

> >>>>>>>>>>>>>>>>>>>>>>>>>>
>                     long fileBytes = file.length();
>                     RepositoryDocument data = new RepositoryDocument();
>                     data.setBinary(is,fileBytes);
>                     String fileName = file.getName();
>                     data.setFileName(fileName);
>                     data.setMimeType(mapExtensionToMimeType(fileName));
>
> <<<<<<<<<<<<<<<<<<<<<<<<<<<
>
>
> do i just need to comment out 3rd line i.e. data.setBinary(is,fileBytes);
> ??
>
>
> Thanks,
> Ameya
>
>
> On Thu, Jul 31, 2014 at 4:17 PM, Ameya Aware <[email protected]>
> wrote:
>
>> I could not exactly locate the position where this is happening.
>>
>> Can you please help me out with the changes?
>>
>> Thanks,
>> Ameya
>>
>>
>>
>> On Thu, Jul 31, 2014 at 4:10 PM, Karl Wright <[email protected]> wrote:
>>
>>> Hi Ameya,
>>>
>>> Since you are already modifying the connector for your purposes, nothing
>>> is stopping you from modifying it further to not fetch the document and
>>> instead substitute an empty input stream.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Thu, Jul 31, 2014 at 3:03 PM, Ameya Aware <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> i have modified code a little to add different metadata fields such as
>>>> below (FileConnector.java):
>>>>
>>>>                     data.addField("created", new
>>>> Date((attr.creationTime().toMillis())));
>>>>                    data.addField("last_accessed", new
>>>> Date(attr.lastAccessTime().toMillis()));
>>>>                     data.addField("last_modified", new
>>>> Date(file.lastModified()));
>>>>                     data.addField("size", file.length());
>>>>
>>>>
>>>> which are being passed to Solr.
>>>>
>>>> Now can i stop MCF from reading a file and sending that content and
>>>> just passed above information to Solr?
>>>>
>>>>
>>>> Thanks,
>>>> Ameya
>>>>
>>>>
>>>> On Thu, Jul 31, 2014 at 2:57 PM, Karl Wright <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Ameya,
>>>>>
>>>>> The file system connector does not retrieve any metadata for a
>>>>> document at all.  So I'm not sure what metadata you are talking about.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 31, 2014 at 2:44 PM, Ameya Aware <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> So the thing here is i am not looking for any data or content of any
>>>>>> of files. I am just interested in metadata of file.
>>>>>>
>>>>>> So i thought it should be possible to not read any file and just get
>>>>>> metadata of file and give to Solr.
>>>>>>
>>>>>> This should save lots of time.
>>>>>>
>>>>>> Is it possible to do this?
>>>>>>
>>>>>> Thanks,
>>>>>> Ameya
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 31, 2014 at 2:13 PM, Karl Wright <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ameya,
>>>>>>>
>>>>>>> (1) Please look at the Simple History report.  Note what kinds of
>>>>>>> documents are being fetched, what kinds are being indexed, and how long 
>>>>>>> it
>>>>>>> is taking.  I have noted from your previous posts that you seem to be
>>>>>>> indexing a lot of very large EXE files.  This is useless and you should 
>>>>>>> be
>>>>>>> excluding them.
>>>>>>>
>>>>>>> (2) Please look in the manifoldcf.log file for evidence that fetches
>>>>>>> and/or Solr indexing requests are being retried due to errors.  It 
>>>>>>> doesn't
>>>>>>> take many documents being chronically retried before forward progress 
>>>>>>> drops
>>>>>>> to near zero.
>>>>>>>
>>>>>>> (3) If you look into (1) & (2) and everything seems fine, it may be
>>>>>>> a misalignment between availability of several kinds of resources that 
>>>>>>> is
>>>>>>> the problem.  Please get a thread dump of the agents process while it is
>>>>>>> crawling, using jstack.  Post that thread dump and we can tell you what 
>>>>>>> to
>>>>>>> look at next.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 31, 2014 at 2:07 PM, Ameya Aware <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>> I am using filesystem connector to index my entire C drive using
>>>>>>>> Solr as output connector.
>>>>>>>>
>>>>>>>> Initial 100000 documents were crawled and indexed successfully in
>>>>>>>> couple of hours but after that indexing slowed down badly (around 15-20
>>>>>>>> documents per min).
>>>>>>>>
>>>>>>>>
>>>>>>>> I am not able to figure out whether there is issue with MCF or Solr.
>>>>>>>>
>>>>>>>>
>>>>>>>> Can you advice me how to proceed with this?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ameya
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Crawling and indexing very slow

Reply via email to